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Abstract 



In this paper we study the space requirement of algorithms that make 
only one (or a small number of) pass(es) over the input data. We study such 
algorithms under a model of data streams that we introduce here. We give 
a number of upper and lower bounds for problems stemming from query- 
processing, invoking in the process tools from the area of communication 
complexity. 

1 Overview 

In this paper we study the space requirement of algorithms that make only one 
(or a small number of) pass(es) over the input data. We study such algorithms 
under a model of data streams that we introduce here. We develop an intimate 
connection between this setting and the classical theory of communication com- 
plexity [AMS96, NK, Yao79]. 

1.1 Motivation 

A data stream, intuitively, is a sequence of data items that can be read once by an 
algorithm, in the prescribed sequence. A number of technological factors motivate 
our study of data streams. 

The most economical way to store and move large volumes of data is on sec- 
ondary and tertiary storage; these devices naturally produce data streams. More- 
over, multiple passes are prohibitive due to the volumes of data on these media; for 
example internet archives [Ale] sell (a large fraction of) the web on tape. This prob- 
lem is exacerbated by the growing disparity between the costs of secondary /tertiary 
storage and the cost of memory/processor speed. Thus to sustain performance for 
basic systems operations, core utilities are restricted to read the input only once. 
For example, storage managers (such as IBM's ADSM [ADSM]) use one-pass dif- 
ferential backup [ABFLS]. 

Networks are bringing to the desktop ever-increasing quantities of data in the 
form of data streams. For data in networked storage, each pass over the data results 
in an additional, expensive network access. 

The SELECT/PROJECT model of data access common in database systems 
give a "one-pass like" access to data through system calls independent of the phys- 
ical storage layout. The interface between the storage manager and the application 
layer in a modern database system is well-modeled as a data stream. This data 
could be a filtered version of the stored data, and thus might not be contiguous in 
physical storage. 
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In a large multithreaded database system, the available main memory is parti- 
tioned between various computational threads. Moreover, operators such as the 
hash-based GROUP BY operator compute multiple aggregation results concur- 
rently using only a single scan over the data. Thus, the amount of memory used 
in each thread influences efficiency in two ways: (1) It limits the number of con- 
current accesses to the database system. (2) It limits the number of different com- 
putations that can be performed simultaneously with one scan over the data. Thus 
effectively, even a 1Gb machine will have provide under 1Mb to each thread when 
supporting a thousand concurrent threads, especially after the operating system and 
the DBMS take their share of memory. 

There is thus a need for studying algorithms that operate on data streams. 

1.2 Scope of the present work 

A data stream is a sequence of data items x\, . . . , Xj, . . . , x n such that the items 
are read once in increasing order of the indices i . Our model of computation can 
be described by two parameters: The number P of passes over the data stream 
and the workspace S (in bits) required in main memory, measured as function of 
the input size n. We seek algorithms for various problems that use one or a small 
number of passes and require a workspace that is smaller than the size of the input. 
Our model does not require a bound on the computation time, though for all the 
algorithms we present, the time required is small. In query settings, the size of the 
output is typically much smaller than the size of the workspace. 

For example, for graph problems (e.g., [MAGQW, MM]), we view the input 
as a sequence of m (possibly directed) edges between n nodes. Our goal is to find 
algorithms where space requirements can be bounded as a function of n (rather 
than m), or in establishing that the space must grow with m. 

Our goal is to expose dichotomies in the space requirements along the different 
axis: (i) between one-pass and multi-pass algorithms, (ii) between deterministic 
and randomized algorithms, and (iii) between exact and approximation algorithms. 

We first describe some classes of problems which can be described in this con- 
text. 

(1) Systems such as LORE [MAGQW] and WEB SQL [AMM, MM, MMM] 
view a database as a graph/hypergraph. For instance, a directed edge might rep- 
resent a hyperlink on the web, or a citation between scientific papers, or a pair of 
cities connected by a flight; in a database of airline passengers a hyperedge may 
relate a passenger, the airports he uses, and the airline he uses in a flight. Some typ- 
ical queries might be: From which airport in Africa can one reach the most distinct 
airports within 3 hops? (The MAXTOTAL problem below.) Of the papers citing 
the most referenced graphics paper, which has the largest bibliography? (The MAX 
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problem below.) 

We propose four problems that model queries such as those described here: 
Consider a directed multigraph with node set V\ U V2 ■ ■ ■ U Vk, all of whose edges 
are directed from a node in V/ to a node in V/+i. Let n = max, \ Vj\. The degree of 
a vertex is its indegree unless specified otherwise. 

The MAX problem. Let u\ be the node of largest outdegree in Vi. Let e V; be 
a node of largest degree among those incident to Find Uk- 

The MAXNEIGHBOR problem. Let u\ have largest outdegree in Vi. Let m e V { 
have the largest number of edges to m,-_i, determine Uk- 

The MAXTOTAL problem. Find a node u 1 e Vi which is connected to the largest 
number of nodes of Vk- 

The MAXPATH problem. Find nodes u\ e V\, u% e Vk such that they are con- 
nected by the largest possible number of paths. 

(2) A second problem class is verifying consistency in databases. For instance, 
check if each customer in a database has a unique address, or if each employee 
has a unique manager/salary. We model these problems as consistency verification 
problems of relations. Let a fc-ary relation R over {0, 1, • • • n] be given. Let <p — 
Vwi, U2, ■ ■ -!3(i>i, t>2, . . .) : f{u\, . . . , v\, . . .)for(wi, U2, . . . , v\, v%, . . .) € R. 

The Consistency Verification problem. Verify that R satisfies <f>. 

(3) More traditional graph problems like connectivity arise [BGMZ97], while 
analyzing various properties of the web. In database query optimization estimating 
the size of the transitive closure is important [LN89]. This motivates our study of 
study of various traditional graph properties. 

(4) As pointed out in [SALP79, AMS96] estimates of the frequency moments 
of a data set can be used effectively for database query optimization. This mo- 
tivates our study of approximate frequency estimation problems and approximate 
selection problems (e.g., find a product whose sales are within 10% of the most 
popular product). 

1.3 Definitions 

Las-Vegas and Monte-Carlo algorithms. A randomized algorithm is an algorithm 
that flips coins, i.e., uses random bits: no probabilistic assumption is made of the 
distribution of inputs. A randomized algorithm is called Las-Vegas if it gives the 
correct answer on all input sequences; its running time or workspace could be a 
random variable depending on the coin tosses. A randomized algorithm is called 
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Monte-Carlo with error probability e if on every input sequence it gives the right 
answer with probability at least 1 — e. If no e is specified, it is assumed to be 2/3. 

Our principal tool for showing lower bounds on the workspace of limited-pass 
algorithms is drawn from the area of communication complexity. 

Communication complexity. Let X, Y, and Z be finite sets and let / : X x Y -> 
Z be a function. The (2-party) communication model consist of two players, A 
and B such that A is given an x e X and B is given an y e Y and they want to 
compute f (x,y). The problem is that A does not know y and B does not know 
x. Thus, they need to communicate, i.e., exchange bits according to an agreed- 
upon protocol. The communication complexity of a function f is the minimum 
over all communication protocols of the maximum over all x e X and all y e Y 
of the number of bits that need to be exchanged to compute f(x,y). The protocol 
can be deterministic, Las Vegas or Monte Carlo. Finally, if the communication 
is restricted to one player transmitting and the other receiving, then this is termed 
one-way communication complexity. In a one-way protocol, it is critical to specify 
which player is the transmitter and which the receiver. Only the receiver needs to 
be able to compute /. 

1.4 Related previous work 

Estimation of order statistics and outliers [ARS97, AS95, JC85, RML97, 01k93] 
has received much attention in in the context of sorting [DNS91], selectivity esti- 
mation [PIHS96], query optimization [SALP79] and in providing online user feed- 
back [Hel]. The survey by Yannakakis [Yan90] is a comprehensive account of 
graph-theoretic methods in database theory. 

Classical work on time-space tradeoffs [Cob66, Tom80] may be interpreted as 
lower bounds on workspace for problems such as verifying palindromes, perfect 
squares and undirected st connectivity. Paterson and Munro [MP80] studied the 
space required in selecting the kth largest out of n elements using at most P passes 
over the data. They showed an upper bound of n x l p logrc and an almost match- 
ing lower bound of n l / p for large enough k. Alon, Matias and Szegedy [AMS96] 
studied the space complexity of estimating the frequency moments of a sequence 
of elements in one-pass. In this context, they show (almost) tight upper and lower 
bounds for a large number of frequency moments and show how communication 
complexity techniques can be used to prove lower bounds on the space require- 
ments. 

Our model appears at first sight to be closely related to papers on I/O com- 
plexity [HK81], hierarchical memory [AACS87], paging [ST85] and competitive 
analysis [KMRS88], as well as external memory algorithms [VV96]. However, our 
model is considerably more stringent: whereas in these papers on memory manage- 
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ment one can bring back (into fast memory) a data item that was previously evicted 
(and is required again), in our model we cannot retrieve items that are discarded. 

1.5 Our main results 

We expose the following dichotomies in our model, (i) Some problems require 
large space in one pass but small space in two. (ii) We show that there can be an 
exponential gap in space bounds between Monte Carlo and Las Vegas algorithms, 
(iii) We show that if we settle for an approximate solution, we can reduce the space 
requirement substantially. Our tight lower bounds for the approximate solution 
apply communication complexity techniques to approximation algorithms. 

Theorem 1 In one pass, the MAX problem requires Q(kn 2 ) space and has an 
O (kn 2 log n) space solution. In P > 1 passes it requires Q (kn / P) space and can 
be solved in 0((kn log n)/P) space. 

Theorem 2 In one pass, the M AXNEIGHB OR, MAXTOTAL, and MAXPATH 
problem require Q(kn 2 ) space and have 0(kn 2 log n) space solutions. 

Notice however, that unlike the MAX problem, the other three do not seem 
to admit efficient two pass solutions. Resolving this remains an open issue. We 
believe that no constant number of passes will result in substantial savings. 

Let R be a k = (ki + ^)-ary relation over {1, . . . n}. Consider the formula 

<f> - Vwi . . .Kjfcj, !3(ui . . . v k2 ) : f(u\ ...u kl ,v\... v kl ) for {u\, ...v { ...)eR 

where / is a function assumed to be provided via an oracle. Also suppose that we 
are presented the relation R one tuple at a time. Then, we have the following: 

Theorem 3 Verifying that R satisfies cp can be done by an 0(log j logrc) space 
Monte Carlo algorithm that outputs the correct answer with probability 1 — S. Any 
Las Vegas algorithm that verifies that R satisfies <p requires at least Q(n 2 ) space. 

Theorem 3 shows an exponential gap between Las Vegas and Monte Carlo algo- 
rithms. In Section 4, we describe an algorithm and its analysis; these are easily 
modified to yield Theorem 3 through a completeness property described further in 
Section 4. We also have the following open problem: Let R be a binary relation. 
Let 4> — Vx3m : (x, u) e R. Is there any sub-linear space Monte Carlo algorithm 
that verifies that R satisfies 0? 

Theorem 4 Given a sequence of m numbers in {1, . . . , n) with multiple occur- 
rences finding the k most frequent items requires Q.(n/k) space. Random sampling 
yields an upper bound of O (n (log m + log n) / k). 
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The proof of Theorem 4 is in Section 5. 

The approximate median problem requires finding a number whose rank is in 
the interval [m/2 — em, m/2 + em]. It can be solved by a one-pass Monte Carlo 
algorithm with error probability 1/10 and 0(log?i(log l/e) 2 /e) space [RML97]. 
We give a corresponding lower bound in Section 5. 

Theorem 5 Any 1-pass Las Vegas algorithm for the approximate median problem 
requires Q.(\/e) space. 

Easy one-pass reductions from the communication complexity of the DISJOINT- 
NESS function [NK] yields: 

Theorem 6 In P passes, the following graph problems on an n-node graph all 
require Q(n/P) space: computing the connected components, k-edge connected 
components with 1 < k < n, k-vertex connected components with 1 < k < n, 
testing graph planarity. Finding the sinks in a directed graph requires &(n/P) 
space. 

Incremental graph algorithms give one-pass algorithms for all the problems of The- 
orem 6. Thus, there are one-pass algorithms for connected components, £-edge and 
^-vertex connectivity with k < 3, and planarity testing that use 0(n logrc) space. 

Theorem 7 For any 1 > e > 0, estimating in one pass the size of the transitive 
closure to within a factor of e requires space Q(m). 

We prove this theorem in Section 5. Computing the exact size of the transitive 
closure requires 0(mlogn) space. 

The lower bounds of Theorems 1, 2, 6 and 7 hold even for Monte Carlo algo- 
rithms that are correct with error probability e for a sufficiently small e. 

All our lower bounds are information-theoretic, placing no bounds on the com- 
putational power of the algorithms. The upper bounds, on the other hand, are all 
"efficient": in all cases, the running time is about the same as the space usage. 

2 Three lower bounds from communication complexity 

Many of the lower bounds in our model build on three lower bounds in communi- 
cation complexity. We review these lower bounds in this section. 

Bit- Vector Probing. Let A have a bit- vector x of length m. Let B have an index 
0 < i < m. B needs to know x, , the ith input bit. The only communication 
allowed is from A to B. 
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There is no better method for A to communicate x, to B than to send the 
entire string x. More precisely, any algorithm that succeeds in B guessing x,- 
correctly with probability better than (1 + e)/2, requires at least em bits of 
communication [NK]. 

Bit- Vector Comparison. Let A and B both have bit-vectors x y respectively, each 
of length m. B wishes to verify that x — y. 

Any deterministic or Las Vegas algorithm that successfully solves this prob- 
lem must essentially send the entire string x from A to B, or vice versa. 
More precisely, any algorithm that outputs the correct answer with proba- 
bility at least e and never outputs the wrong answer must communicate em 
bits [NK]. 

Bit- Vector Disjointness. Let A and B both have bit- vectors x y respectively, each 
of length m. B wishes to find an index i such that x,- — 1 and y,- = 1. 

There is no better protocol than to essentially send the entire string x from 
A to B, or vice versa. More precisely, any algorithm that outputs the cor- 
rect answer with probability at least 1 — e (for some small enough e) must 
communicate Q(m) bits [NK]. 

Notice that the second theorem is weaker than the first and the third in some re- 
spects: it does not apply to Monte Carlo algorithms. There is a good reason: there 
is a Monte Carlo algorithm that does much better, i.e. communicates only 0(log n) 
bits. On the other hand, the first theorem is weaker than the second and third in 
some respects: it insists that there be no communication in one of the two direc- 
tions. This too is for good reason: B could send A the index, and then A could 
respond with the bit. For a description of these and other issues in this area, see 
[NK]. 

3 One pass versus many passes 

Our goal in this section is to outline the proof of Theorem 1 showing that some 
problems require large space in one pass but small space in two. We give here a 
lower bound of Q(n 2 ) on the space used by any Monte Carlo one-pass algorithm 
for the 2-layer MAX problem; a somewhat more elaborate construction (omitted 
here) yields a lower bound of Q (kn 2 ) for the -layer version. 

Proof of Theorem 1. We provide a reduction from the bit- vector probing prob- 
lem. Denote the node-set on the left of the bipartite graph by U, and the node-set 
on the right by V, where \ U\ = \V\ — n. We further partition U into JJ\,JJ% where 
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\Ui\ — n/3 — *Jm. Likewise we partition V into Vi, V2 where | Vi | = n/3. The 
bit-string x is interpreted as specifying the edges of a bipartite graph on U\ x Vi in 
the natural way, i.e. the edge (u, v) corresponds to index u^/m + v. On getting the 
query i , we translate it to an edge (u , v) , u € U\,v € V\ and augment the graph 
with edges (u, v') and (u' , v) for each u' € U2 and v' € V2. The answer to the 
MAX problem on this graph is v if and only if the edge (u , v) is in the bipartite 
graph, i.e. the ith input bit is set. 

The MAX problem is solved in 2 passes with space 0(kn log n), even on k 
layered graphs: In the first pass find the degree of each vertex. In the second pass 
determine the highest degree neighbor in V, for each vertex in V;_i. Then compute 
u\ and repeatedly find the highest degree neighbor of the current node until u\ is 
determined. This algorithm can be modified to use only space O (kn log n/P)'mP 
passes. 

Note that the lower bound proof that we provide above applies to approximate 
versions of the MAX problem as well, namely, it requires Q (n 2 ) space to compute 
a near max degree neighbor of a vertex with near max degree in U. 

Proof of Theorem 2. The proofs for all three problems are reductions from the 
bit-vector probing problem similar to the proof of Theorem 1. 

To show the bound for the MAXNEIGHBOR problem we construct the same 
initial graph as in the proof of Theorem 1 , but double each edge from V\ to JJ\ . On 
getting the query i, we translate it into an edge (u,v),u e U\,v € Vi, and add 
this edge to the graph. Additionally we augment the graph with two edges (u, v') 
for each v' e V2. Then there are three edges between u and v iff the ith input bit 
is set; otherwise there is only one edge between u and v. Thus, v is returned iff the 
z'th input bit is set. 

For the MAXTOTAL problem construct a tripartite graph with node set U U V U 
W, where \U\ = \ V\ — \W\ — *Jm + 1. The nodes in set U are numbered from 
1 to *Jm + 1. The same holds for set V and set W. As in the proof of Theorem 1, 
the bit-string x is translated into edges from U to V as follows: edge (u, v) exists 
in the graph iff index u*Jm + v of x is set. Additionally there is an edge from node 
y/m + 1 in U to node «Jm + 1 in V and from the latter node to node */m + 1 in 
W. On getting a query i we translate it into an edge (u,v) with u e U and v € V 
and augment the graph by edges (v, w) for each w € W. Then u reaches the most 
nodes in W iff the ith input bit was set; otherwise node *Jm + 1 of U reaches the 
most nodes of W. 

For the MAXPATH problem augment the graph for the MAXTOTAL problem 
with a fourth node set X and connect every node of W with an edge to the same 
node x of X. Using the same reduction for a query as for problem MAXTOTAL 
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shows that u and x are connected by the largest number of paths iff the ith input 
bit was set; otherwise node «fm + 1 and x are connected by the largest number of 
paths. 

4 Las Vegas versus Monte Carlo 

In this section we present an exponential gap between Las Vegas and Monte Carlo 
one-pass algorithms. The symmetry property given a sequence of ordered pairs over 
{1, ... ,n], is that (u, v) is in the sequence if and only if there is a unique (v, u) in 
it. In this section, we show a 0(log«) space Monte Carlo algorithm to verify the 
symmetry property. By contrast, any one pass Las Vegas algorithm requires Q (m) 
space, where m is the size of the relation. 

Algorithm. Choose p, a random prime smaller than n 3 . Let l UiV = n 2 ^ nu )+ v ) 
if u < v and Z H U — — (n 2((n " )+i;) ) if u > v and 0 otherwise. Compute the sum 
s = J2( u v)eR^,v modulo p. Check if s — 0 mod p at the end. Storing s mod 
p requires only log p < 3 log n space. Also check that there are no more than n 2 
edges in all. 

Theorem 8 The above algorithm will output a correct response with probability 
at least 1 — (2 log 2 n / n). Moreover, any one pass Las Vegas algorithm that outputs 
the correct response with probability 2/3 or more uses Q (n 2 ) space. 

Proof. It is easily seen that v ^ eR Z M>1) is 0 if and only if R is symmetric. On the 
other hand, it follows from the Chinese Remainder Theorem and the fact that there 
are at least n 3 / log n primes smaller than n 3 , that the probability that a non zero sum 
evaluates to 0 modulo a random prime is smaller than 2 log n/n. This follows, 
since s could be 0 modulo p for at most 2n 2 log n of them since s < 2 2 ' 1 log ". 

The lower bound follows from a reduction from the bit-vector comparison 
problem, complexity. Assume that one player has a graph with each edge directed 
from the smaller numbered vertex to the larger numbered one. The second player 
has a graph with each edge directed the other way. The union of the two inputs is 
symmetric if and only if the two graphs are the same. Consider a truth table whose 
rows are indexed by all the possible inputs of the first player, and the columns are 
all possible inputs of the second player; each entry is the output corresponding to 
the players' inputs for that row and column. The truth table corresponding to the 
symmetry relation under our class of inputs has 2© distinct rows, one correspond- 
ing to each possible graph. By a technique from communication complexity [NK], 
we require at least c(^) communication in any algorithm that purports to solve this 
problem with probability (1 + c)/2. Consider the following input data stream: we 
present the first player's ordered pairs followed by the second player's. The state of 
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the data stream algorithm after the first half represents the communication between 
the players. ■ 

More interesting, however, is a completeness property that arises here. Namely, 
that every problem of the form in Theorem 3 can be reduced to the symmetry 
problem. 

Proof of Theorem 3. The reduction works as follows: we encode each tuple 
(wi, . . . Wfcj) as an index i using a standard Godel encoding, g, let this encoding 
range from 1 through m. Then we output the ordered pair (/, 0) for each 1 < i < m. 
Then for each tuple {u\ . . . , v\, . . .) we output (0, g{u\, . . . u^)) if and only if 
f{u\, . . . , v\ . . .). The resultant graph is symmetric if and only if the relation R 
satisfies 0. 

5 Exact versus approximate computation 

In this section, we show that if we settle for an approximate solution, we can reduce 
the space requirement substantially. Our matching lower bounds for the approxi- 
mate solution require a generalization of communication complexity techniques to 
approximation algorithms. 

Proof of Theorem 4. Alon et al. show that finding the mode (i.e., the most 
frequently-occurring number) of a sequence of m numbers in the range {1, . . . , n} 
requires space Q(n). By a simple reduction (replace each number i in the original 
sequence by a sequence of k numbers ki + 1 , ki + 2, . . . ki + k) it follows that that 
finding one of the k most frequent items in one pass requires space Q,(n/k) 

The almost matching upper bound is given by the following Monte-Carlo algo- 
rithm that succeeds with constant probability: before the start of the sequence sam- 
ple each number in the range with probability l/k and then only keep a counter for 
the successfully sampled numbers. Output the successfully sampled number with 
largest count. With constant probability one of the £-th most frequent numbers has 
been sampled successfully. This needs 0(n(logm + log n)/k) space. 

Proof of Theorem 5. We show that any algorithm that solves the e -approximate 
median problem requires Q(l/e) space. The proof follows from a reduction from 
the bit-vector probing problem. Let b\ , b2 ■ ■ ■ b n be a bit vector followed by a query 
index i. This is translated to a sequence of numbers as follows: First output 2 j + bj 
for each j . Then on getting the query, output n — i — 1 copies of 0 and i + 1 copies 
of 2(n + 1). It is easily verified that the least significant bit of the exact median of 
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this sequence is the value of bj. Choose e — ^. Thus, the e approximate median 
is the exact median. Thus, any one pass algorithm that requires fewer than ^ = n 
bits of memory can be used to derive a communication protocol that requires fewer 
than n bits to be communicated from A to B in solving bit vector probing. Since 
every protocol that solves the bit- vector probing problem must communicate n bits, 
this is a contradiction. 

We next prove Theorem 6. 

Proof of Theorem 6. We reduce the bit-vector disjointness problem to the graph 
connectivity problem. Construct a graph whose node-set is [a, b, 1, 2, ... , n}. In- 
sert an edge {a, i) if bit i is set in A's vector, and insert an edge (b, i) if bit i is 
set in B's vector. Now, a and b are connected in the graph if and only if there 
exists a bit that is set in both A's vector and B's vector. By the lower bound for the 
bit- vector disjointness problem, every protocol must exchange Q(n) bits between 
A and B. Thus, if there are P passes over the data, one of the passes must use at 
least Q(n/P) space. The reduction for £-edge or ^-vertex connectivity follows by 
adding k — 1 nodes ci , . . . , ct-i and an edge from each Cj, 1 < j < k — 1 to both 
a and b. 

To reduce to planarity testing we add four nodes c\ , C2, C3 , C4 and connect them 
pairwise. Additionally we add the edges (c\, a), (C2, a), (C3, a), and (C4, b). Then 
the graph contains ^5 as a minor if and only if a and b are connected. 

We also reduce the bit-vector disjointness problem to the problem of deciding 
whether the graph contains a sink. Construct a graph whose node-set is {a , b, 1 , 2, . . 
Insert edges (a, b) and (b, a) to guarantee that neither of them is a sink. If bit i is 
set in A's vector, insert an edge (a, i), otherwise insert an edge (i, a). Similarly, if 
bit i is set in B's vector, insert an edge (b,i), otherwise insert an edge (i, b). Now 
node i is a sink if and only if bit i is set in both A's and B's vector. It follows 
that the graph contains a sink if and only if there exists a bit that is set in both A's 
vector and B's vector. By the lower bound for the bit-vector disjointness problem, 
every protocol must exchange Q(n) bits between A and B. Thus, if there are P 
passes over the data, one of the passes must use at least Q (n/P) space. 

A P-pass algorithm that keeps a bit for node (i — l)n/P, (i — l)n/P + 1, 
. . . , in IP — 1 in pass i indicating whether an edge leaving the node was read gives 
the desired upper bound. 

We also provide a lower bound here for the transitive closure problem. 

Proof of Theorem 7. We reduce the bit- vector probing problem to the transitive 
closure estimation problem. Let d > 1 be a constant. Given a bit- vector of length 
m, we construct a graph G on 2{dm + jm) vertices, V,-, 1 < i < 4, with | V2I — 
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I V3 1 — *Jm and |Vi| — | V4 1 — dm, such that edge (/, j) with / e V2 and j e V3 
exists iff entry i «Jm + j is set in the vector. To test whether entry i *Jm + j is set, 
add edges from each vertex in V\ to i e V2 and from j e V3 to each vertex in V4. 
The size of the transitive closure is larger than m if and only if the edge (i, j) is in 
the graph. Furthermore, for e < 1 — 2/(d 2 + 1), any e -approximation algorithm 
for the transitive closure can answer a query correctly. Thus, any e -approximation 
algorithm must use Q, (m) space. 

6 Further work 

Our work raises a number of directions for further work; we list some here: 

1. We need more general techniques for both lower and upper bounds when 
multiple passes can be performed over the data. They might also imply in- 
teresting new results about communication complexity. From a practical 
perspective, algorithms are needed for a wider class of problems than the 
selection problem that has been extensively studied [ARS97, AS95, JC85, 
RML97, 01k93]. 

2. Can we design algorithms that minimize the number of passes performed 
over the data given the amount of memory available? This would be use- 
ful when, for instance, the number of active concurrent threads governs the 
memory available at runtime 

3. How can we arrange the data physically in a linear order with the express 
goal of optimizing the memory required to process some set of queries? 
Recall that the results of a query may not necessarily be physically con- 
tiguous (e.g., in the database of airports, the subset from Africa may not be 
together; more generally, we will have to cope with the results of some class 
of SELECT and GROUPBY operations). Can we model the class of "likely" 
queries and use it to drive the data layout? 

4. From a theoretical perspective, we have highlighted the importance of study- 
ing the communication complexity of approximation problems (as in our 
bounds for the approximate solutions of selection and transitive closure); 
existing work only treats computations that yield exact answers. 
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